Objective

The objective of the task is to analyse the pricing of car insurance premiums of Suncorp against its three competitors. The dataset consists of factors that affect the pricing of insurance premiums (e.g vehicle details, owner / driver information, the details of other insurances on the vehicle, etc.)

Process flow :


Data Cleaning

As a first step, we will carry out some basic cleaning

Eliminate Duplication

  • The Policy Commencement field has only 2 dates and both have same count of records.
  • The sum,mean and standard deviation of PREMIUM columns also gives same values for all the insurers except for INSURER2
    • This could possibly mean that the three insurers (1,3 and 4) have not done a pricing review between the 2 dates
    • The average price increase for INSURER2 is around 2%
  • To avoid duplication we will only consider the 1st Sep 2019 records for our analysis
Type COMMENCEDATE INSURER1_PREMIUM INSURER2_PREMIUM INSURER3_PREMIUM INSURER4_PREMIUM
Count 2019-07-01 29752.00 29752.00 29752.00 29752.00
Count 2019-09-01 29752.00 29752.00 29752.00 29752.00
Sum 2019-07-01 7460723.80 8499325.77 8961579.87 6955641.12
Sum 2019-09-01 7460723.80 8659091.81 8961579.87 6955641.12
Mean 2019-07-01 250.76 285.67 301.21 233.79
Mean 2019-09-01 250.76 291.04 301.21 233.79
Std. Dev 2019-07-01 67.03 66.88 52.43 53.01
Std. Dev 2019-09-01 67.03 64.47 52.43 53.01

Columns with redundant information

  • These column pairs do not provide additional information, we will drop one of them from the dataset
    • The two columns ‘At fault in 5 years’ and ‘At fault in 2 years’ have same data distribution
    • Similarly the VEH AGE is derived using MANUFYEAR (with 2019 as base)
    • The MVINSYEARS column also has a high correlation with NRMASST colum

Combine rare levels

  • The distribution of Car MAKE is top heavy, ie the top 13 car companies contribute towards 90% of the total records (volume and not value), so we will retain only these 13 MAKEs and merge the rest as ‘OTHERS’
  • And then keep top 3 models under each of the MAKEs
  • This is done as majority of the ML models do not handle high number of data levels. Also, by merging these rare levels we do not ‘expect’ to lose much information
  • BEFORE : MAKE count: 58 and MODEL count : 631

  • AFTER : MAKE count: 14 and MODEL count : 43

Other Data Manipulations

  • We will convert the character columns to factors
  • And also few other manipulations for readbility and to maintain consistency

Exploratory Analysis

Now lets look at the individual feautures

Numeric Features

  • The three numeric ‘age’ features have an almost normal distribution
    • This is expected as majority of the driving population is in their mid-ages
    • The record count of vehicle age is nearly constant from 4 to 12 years, after which it starts dropping. This probably means that new cars are insured by a different providers or it could also mean that people have switched their insurance companies for their existing older cars. This would have been more clear if the volumes were provided.
    • For modeling purposes, these columns could be converted to bins but for now we will leave them as is

Categorical Features

  • The below charts gives the distrbution of the categorical features
    • Due to the rare-level merging that we did earlier the ‘OTHERS’ category in the MODEL feature now has high frequency of records
    • The LICCANCNT has a very low proportion (0.0043) of records and could ideally be ignored. But since it contains the conviction/suspension information we will retain it.

Feature Correlations

  • We do not see much collinearity in the feature set as we have already done some cleaning in the previous steps
  • We can see that the Owner Age and Youngest driver age is highly correlated, possibly an indication that the majority of the car owners are the sole drivers of the vehicle
  • The MVINSTYPE = N has a high correlaion with MVINSURER= NONE, but we will not treat it
  • Assuming the Insurance Premiums to be the target variables, we notice they have slight correlations with the feature set. We will explore this further in the feature importance section


Premium price analysis

In this section, we will take a deeper look at how the four insurers are pricing their products?

Frequency Distribution

  • The premium prices for all the insurers are in the range of 190-390
    INSURER1_PREMIUM INSURER2_PREMIUM INSURER3_PREMIUM INSURER4_PREMIUM
    Min. :197.1 Min. :192.9 Min. :202.0 Min. :194.3
    1st Qu.:209.3 1st Qu.:239.8 1st Qu.:262.0 1st Qu.:206.3
    Median :219.0 Median :277.4 Median :295.3 Median :211.1
    Mean :250.8 Mean :291.0 Mean :301.2 Mean :233.8
    3rd Qu.:247.9 3rd Qu.:367.9 3rd Qu.:338.9 3rd Qu.:225.5
    Max. :386.4 Max. :385.2 Max. :388.2 Max. :385.3
  • Below box plots shows their distributions. We can see that:
    • The premium prices of Insurer 2 and 3 are higher on average
      • Insurer 2 has a wider IQR, which is a good strategy as their prices would provide more/customised options to their customers. This could possibly result in higher revenue
      • Insurer 3 has the highest median price
    • Insurer 1 and 4 are similar in their pricing
      • They both have a median price of around 210-220, which is much lower than their competitors.
      • And they also have outliers

  • Looking at histograms
    • It looks like the prices of Insurer 2 and 3 are capped to an upper limit, probably as a result of some outlier treatment
    • The frequency distribution of Insurer 1 and 4 is not normal which indicates they have a pricing strategy which is based on price slabs. (ie the pricing is done in steps)
    • As compared to the pricing of Insurer 2 and 3, which is more dynamic and smooth

Feature Importance

  • Now, we will look at the factors that each INSURER uses to arrive at their premium pricing
    • We could have used tree algorithms like RandomForest or Xgboost as they have give feature importance
    • But we will use the entropy and information gain algorithm from the FSelectorRcpp package to determine the feature importance. Source
  • We can see that INSURER 1 and 4 give high weightage to the age features to arrive at their pricing, compared to the others
  • Alongwith the age column, the pricing of Insurer 2 also considers features like Demerit points and average annual kilometer
Table gives the top 10 features for each Insurer (MAKE and MODEL columns were ignored)
name INSURER1_PREMIUM INSURER2_PREMIUM INSURER3_PREMIUM INSURER4_PREMIUM
YDAGE 0.42 0.15 0.18 0.28
OWNERAGE 0.27 0.10 0.09 0.38
VEHAGE 0.19 0.17 0.16 0.30
NOYRLICOBT.Low 0.17 0.10 NA NA
MVINSTYPE.N 0.06 0.03 0.06 NA
MVINSURER.NONE 0.06 0.03 0.06 NA
ATFAULT5YRS.Y 0.03 NA 0.03 NA
NOYRLICOBT.Med 0.03 NA 0.03 0.03
SHAPE.WAG 0.02 NA NA NA
NCBPCT.60 0.01 NA NA 0.05
COMREGVEH.Y NA NA NA 0.06
DEMERITPTS.Low NA 0.09 NA 0.13
DEMERITPTS.No NA 0.12 NA 0.17
MVINSTYPE.T NA 0.04 0.04 NA
MVINSURER.NRMA NA NA 0.03 NA
NRMAASST.5 NA NA 0.04 NA
REGUSE.PRIV NA NA NA 0.03
YEARLYKM.15 NA 0.02 NA 0.05
  • Below are the variable importance plots for all the four insurers (features not sorted)

Pricing of YDAGE column

From the above analysis we can see that YDAGE is an important feature. Lets look into more details

  • Below boxplots shows how the different insurers price their products for different ‘Youngest Driver’ age groups
    • If we compare the distribution of Insurer 1 and 4 with that of 2 and 3, we can see that Insurer 2 and 3 have much smoother price curves
    • Insurer 1 and 4 charge uniformly for different age groups
  • Similar plots for other top features (not shown) also indicates that the pricing strategy of Insurer1 is not dynamic
    • All the remaining features can be explored using the Shiny App hosted on shiny.io


Pricing Model

We will build a very basic pricing model for Insurer 1 using XGBoost algorithm

Process

  • We will use the prices of Insurer1’s competitors as Target variable
    • The model learns from the pricing strategies applied by the competitors and comes out with the prices that can be used by Insurer1
    • Insurer1 could use the predicted prices to conduct an A/B testing and measure the impact on its sales and revenue

Model Evaluation

  • If we look at the below boxplots, we can see that the model (INSURER1_NEW plot) predicts a median price which is higher than its existing price and lower than Insurer 2 and 3
  • The distribution of the prices is approximately normal as seen in the histogram

Limitations

  • Prior Sales and Market Demand are the key factors for product pricing. Vehicle insurance pricing also relies heavily on factors like past accidents and claims, none of which were available for modeling
  • Key model improvement strategies like feature engineering, resampling, hyperparameter tuning, model ensembling, etc were not utilised
  • The pricing structure predicted by the model is better than the one currently used by Insurer1, but the model itself is nothing better than an average model and only included here as a POC


Summary